Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 gracefully shutdown reconcilers and catalogd FBC server #1688

Closed

Conversation

joelanford
Copy link
Member

Description

This PR adds a new context utility package with a new kind of context that is cancelled some duration after its parent context is cancelled. This is useful for reconcilers and servers where having some extra time to complete can decrease the chance of interruptions.

The particular motivation for this is a case we've seen in e2e tests where the operator-controller pod is cancelled while it is in the middle of an installation or upgrade. If this process is interrupted, it leaves the underlying installation in an intermediate state that requires human intervention to recover.

While this change does not specifically try to make it possible to progress from this intermediate state, it does make it less likely for the interruption to happen during this critical section.

It is an open question of the design philosophy of the project how an unfinished installation/upgrade should be treated. The current stance is "let a human decide what to do", but that is not a reasonable answer at scale, so I think we'll need to think about further improvements to help OLM know when a roll forward (or back) is safe.

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

@joelanford joelanford requested a review from a team as a code owner January 31, 2025 21:23
@joelanford joelanford changed the title 🐛 gracefully shutdown reconcilers and catalogd FBC 🐛 gracefully shutdown reconcilers and catalogd FBC server Jan 31, 2025
Copy link

netlify bot commented Jan 31, 2025

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit ad9f130
🔍 Latest deploy log https://app.netlify.com/sites/olmv1/deploys/679f5f4ffc7c670008fcd7b7
😎 Deploy Preview https://deploy-preview-1688--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@joelanford joelanford added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 31, 2025
@joelanford
Copy link
Member Author

joelanford commented Jan 31, 2025

I added the do-not-merge/work-in-progress label for two reasons because maintainers should discuss whether this is a change we want to make and if there is a deeper problem we still want to solve around recovering from interruptions.

Copy link

codecov bot commented Jan 31, 2025

Codecov Report

Attention: Patch coverage is 90.47619% with 8 lines in your changes missing coverage. Please review.

Project coverage is 68.21%. Comparing base (9b08aea) to head (ad9f130).
Report is 83 commits behind head on main.

Files with missing lines Patch % Lines
...rnal/controllers/core/clustercatalog_controller.go 0.00% 6 Missing ⚠️
catalogd/internal/serverutil/serverutil.go 0.00% 1 Missing ⚠️
...nternal/controllers/clusterextension_controller.go 83.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1688      +/-   ##
==========================================
+ Coverage   67.74%   68.21%   +0.46%     
==========================================
  Files          57       58       +1     
  Lines        4620     4697      +77     
==========================================
+ Hits         3130     3204      +74     
- Misses       1265     1268       +3     
  Partials      225      225              
Flag Coverage Δ
e2e 53.37% <54.54%> (-0.08%) ⬇️
unit 55.18% <89.28%> (+0.68%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@azych
Copy link
Contributor

azych commented Feb 3, 2025

I'm not sure about this as I worry that using those delayed contexts might make it more difficult to debug and understand things in the future with the additional waiting times.

Maybe alternatively we could try and solve this by adding (additional) cleanup logic when we get the parent context cancellation for the components that might need it and then placing single-place 'global' timeout (eg. in main.go) so that this logic can fire when we get the interrupt signal? WDYT?

@joelanford
Copy link
Member Author

joelanford commented Feb 4, 2025

Yeah, things could get out of hand if these delayed contexts exist all over the place, I agree.

But I'm not sure cleanup logic is the right concept either. For one, we would generally want cleanup logic related to a particular CR object to run as part of reconcile for better recovery mechanics. But also, in the case of an interrupted helm release, there is no safe cleanup after the helm release has been interrupted unless there is some signal from the chart author, the user, or both.

So I'm trying to think through a mechanic where:

  1. we stop pulling items off the controller queues
  2. we allow any in-flight reconciles to continue for some period of time. it would be interesting if we had some metrics history for a histogram of reconcile duration. I know we have a metric for that, but I don't think we have any historical data.

Also, I'm not at all attached to this PR, and I can definitely see how it has downsides. Another downside is that it could make issues due to interruptions even harder to find and reproduce.

This PR is mainly a lever for getting a conversation going, so I'm interesting to see where this leads!

@joelanford
Copy link
Member Author

Also of note, I just looked at our helm usage, and tracing through helm applier -> helm-operator-plugins -> helm, we are using the helm install/upgrade variants that use context.Background(), so I suspect that existing code is running up to terminationGracePeriodSeconds after the stop signal is received by the process.

Maybe we should switch over to Helm's RunWithContext methods in helm-operator-plugins. But that would risk making things worse if we don't have some other mitigation in place.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 4, 2025
@openshift-merge-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@azych
Copy link
Contributor

azych commented Feb 4, 2025

I think it might be a good start to go through all of the reconciliation logic and try to identify places where an interruption could mean entering into this intermediate state. Once we have that we should better understand the scope of this and think about what can we do in each specific case to potentially mitigate/cleanup any sideffects to a working state or if there is anything we could actually do. Also, fully agree about taking into account any metrics we might have, or adding new ones to have a better understanding if need be.

Just thinking out loud, but one rough idea could be to keep track of how many reconciliations are in those critical places and communicate this outside of reconciliation loop and specific controller, so that upon receiving the interruption signal we could for example adjust the behavior/timeout on a more "global" level, which should be potentially easier to reason about and debug.

Update:
so to have something done right now, we could technically:

  • pass a separate context to the one we get from ctrl.SetupSignalHandler() to manager
  • catch an interruption via ctx := ctrl.SetupSignalHandler() like we currently do, but then allow a custom grace period timeout before we cancel the actual separate context we passed to manager

The main issue here is that the context we pass to manager at the start governs all of its operations including lifecycle of its controllers so I'm not sure what options would we have to handle stopping new reconciliation events in this scenario (or generally tbh) outside of this context. From what I've seen there are options like Unmanaged controllers, which are not automatically added to manager or more granular context separation, though that might lead to potentially similar issues with complexity and decreased debuggability.

@joelanford joelanford closed this Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants